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ABSTRACT 

We present 'dcGO' (http://supfam.org/ 
SUPERFAMILY/dcGO), a comprehensive ontology 
database for protein domains. Domains are often 
the functional units of proteins, thus instead of 
associating ontological terms only with full-length 
proteins, it sometimes makes more sense to asso- 
ciate terms with individual domains. Domain-centric 
GO, 'dcGO', provides associations between onto- 
logical terms and protein domains at the superfam- 
ily and family levels. Some functional units consist 
of more than one domain acting together or acting 
at an interface between domains; therefore, onto- 
logical terms associated with pairs of domains, 
triplets and longer supra-domains are also 
provided. At the time of writing the ontologies in 
dcGO include the Gene Ontology (GO); Enzyme 
Commission (EC) numbers; pathways from 
UniPathway; human phenotype ontology and 
phenotype ontologies from five model organisms, 
including plants; anatomy ontologies from three or- 
ganisms; human disease ontology and drugs from 
DrugBank. All ontological terms have probabilistic 
scores for their associations. In addition to associ- 
ations to domains and supra-domains, the onto- 
logical terms have been transferred to proteins, 
through homology, providing annotations of >80 
million sequences covering 2414 complete 
genomes, hundreds of meta-genomes, thousands 
of viruses and so forth. The dcGO database is 
updated fortnightly, and its website provides down- 
loads, search, browse, phylogenetic context and 
other data-mining facilities. 



INTRODUCTION 

Scientists are increasingly confronted with the grand chal- 
lenge: how to convert sequenced genome information into 



higher-order knowledge on function (1), phenotype (2) 
and even human disease (3). 

The domain-centric Gene Ontology (dcGO) database at 
http://supfam.org/SUPERFAMILY/dcGO is a compre- 
hensive ontology resource that contributes to the afore- 
mentioned challenge through a new domain-centric 
strategy. Our method, dcGO (4), annotates protein 
domains with ontological terms. Ontologies are hierarch- 
ically organized controlled vocabularies/terms defined to 
categorize a particular sphere of knowledge (5). For 
example, 'Gene Ontology' (GO) was created to describe 
functions of proteins (6). Ontological labels are already 
available for full-length proteins, derived from experimen- 
tal data for that protein. The dcGO approach takes the 
terms attached to full-length sequences, and combines 
them with the domain composition of the sequences on 
a large scale, to statistically infer for each term which 
domain is the functional unit responsible for it. The 
method has been formulated in a general way, enabling 
it to be applied to numerous ontologies. The dcGO 
database now contains a panel of ontologies from a 
variety of contexts: functions such as GO (6,7), enzymes 
(8), pathways (9) and keywords used by UniProt (10); 
phenotype and anatomy ontologies across major model 
organisms, including mouse (11), worm (12), yeast (13), 
fly (14), zebrafish (15), Xenopus (16) and Arabidopsis (17); 
human phenotypes (18), diseases (19) and drugs (20). In 
addition to complete sets of ontological terms, a collapsed 
subset (slim version) is also provided for each ontology. 
The automatically generated slim version of each ontology 
is based on annotation frequency (4), and it provides the 
user with a manageable and more coarse-grained list. This 
is analogous to the 'GO slim' provided by the Gene 
Ontology consortium, which has proven useful for enrich- 
ment analyses (21). 

The domain definitions used in dcGO are taken from 
the structural classification of proteins (SCOP) (22) clas- 
sified at both the superfamily and family levels. SCOP 
groups domains at the superfamily level if there is struc- 
ture, sequence and function evidence for a common evo- 
lutionary ancestor. Some superfamilies are sub-divided 
into families, which often share a higher sequence 
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similarity and a related function. In addition to individual 
domains at these two different levels, dcGO also offers 
annotations for combinations of domains. We use the 
concept of supra-domains to describe combinations of 
two or more successive domains of known structure. In 
addition to providing ontology for SCOP domains, the 
generality of the method has enabled us to also include 
Pfam (23) domains in dcGO. 

Our domain-centric ontology derived from proteins 
with experimental evidence can be turned and used as 
a predictor on proteins of unknown function but where 
the domain content is known. The 'dcGO Predictor' 
provides pre-computed functional annotation [using 
SUPERFAMILY hidden Markov models (24)] of all se- 
quences in UniProt (25), 2414 completely sequenced 
genomes, thousands of viral genomes and hundreds of 
meta-genomes. The dcGO website also has a facility for 
the user to submit their own sequences for function pre- 
diction. The dcGO Predictor took part in the recent 
Critical Assessment of Functional Annotation experiment 
(http://biofunctionprediction.org); hence, for comparison 
with other non-domain-centric predictors, we refer the 
reader to this independent evaluation of its performance. 
A key result from the experiment, however, was that 
dcGO performs significantly better than the most 
commonly used method for GO annotation, Basic Local 
Alignment Search Tool searches against UniProt (25). 

In the main body of the article later we describe the 
database contents in detail, doing so separately for GO 
and for other biomedical ontologies (collectively denoted 
as 'BO' hereinafter). Then, we provide an overview of 
various utilities available through the website that may 
interest users. Finally, we conclude with planned future 
developments. 

DATABASE CONTENTS 

Algorithm summary 

To fully understand the content of the dcGO database 
(Table 1), it is necessary to give a description of the algo- 
rithm that is used to build it. Without loss of generality, 
we take the GO as an exclusive example, but in principle, 
the applications to other ontologies are the same. For 
more detail, the reader is referred to a previous publica- 
tion of the algorithm (4). 

The GO is designed to annotate full-length proteins in a 
species-independent manner for generality. The most com- 
prehensive protein-level annotations are maintained by 
the Gene Ontology Annotation (GOA) project (7). 
Motivated by a domain-centric viewpoint, we developed 
a general algorithm (4) for revealing functional signals 
carried by protein domains (and supra-domains in the 
multi-domain proteins). Using protein domain architec- 
tures from SUPERFAMILY and protein GO annotations 
from UniProKB-GOA (respecting the GO hierarchy), we 
first prepare a correspondence matrix between domains/ 
supra-domains and GO terms. Each entry has the 
observed number of UniProt proteins that contain a 
domain/supra-domain (columns) and that can be 
annotated by a GO term (rows). With this correspondence 



matrix, we then use Fisher's exact test to infer associations 
between the rows and columns. On top of this, we take 
advantage of the true path rule of the directed acyclic 
graph of the GO to determine the optimal level at which 
to make an association. We achieve this by comparing the 
significance of each term using two different backgrounds, 
one background using all analysable UniProt proteins 
(being annotatable by the GO), and one background 
using only those UniProt proteins annotated to direct 
parents of the term. If a GO term and its parent term 
are both significantly associated with a domain/ 
supra-domain using the first background, and if the term 
is not significantly different from the parent term using the 
second background, then it is desirable to only associate 
the parent term. As a result of these dual constraints, only 
the most significant GO term associations to domains/ 
supra-domains will be retained. 

The significance of association is assessed by the method 
of false discovery rate (FDR) to account for multiple 
hypothesis tests, whereas the strength of association is 
measured by a hypergeometric distribution-based score. 
For a domain/supra-domain, the associated GO terms 
(i.e. direct annotations) are propagated to all ancestor 
terms (i.e. inherited annotations); together they constitute 
a complete GO annotation profile. Based on the informa- 
tion content of a GO term (i.e. negative logarithmic trans- 
formation of frequency of domains/supra-domains 
annotated to that term), a search procedure is applied to 
partition the directed acyclic graph structure of the GO, 
each partition reflecting the same or similar specificity but 
located in distinct paths. With four seeds of increasing 
information content, the procedure produces the 'GO 
slim' that contains GO terms classified into four levels of 
increasing granularity. These are highly general, general, 
specific and highly specific (Supplementary Table SI). The 
use of information content (as a measure of how specific 
and informative a term is) adds great value to the existing 
GO hierarchy for the user. The GO was created for 
annotating proteins, so some parts of GO structure are 
less valid for annotating domains/supra-domains than 
others. Rather than merely relying on the ontology 
graph depth to define the term specificity, our approach 
has taken into account actual usage of terms when 
determining the four-level depth classification of 
domains/supra-domains. 

Domain-centric GO 

Using the algorithm described earlier in the text, dcGO 
provides the user with two alternative versions of the GO 
associations with domains (Table 1). The high-quality 
version of associations includes only those that are sup- 
ported by the unambiguous evidence (in terms of the 
causal domain) that comes from single-domain proteins 
of known function. The high-coverage associations also 
include those that are supported through statistical disam- 
biguation from multi-domain proteins of known function. 
The high-quality associations are more reliable in their 
domain-centricity, but high-coverage associations are 
reliable enough for large-scale studies and provide a 
much greater coverage of function. Enrichment analyses 
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are improved more by annotation coverage than by anno- 
tation quality; hence, there is a strong justification for 
using the high-coverage version in such studies. 
Restricting the annotations to GO slim (described earlier 
in the text) is also highly recommended for domain-based 
enrichment analyses. 

In addition to individual domains, dcGO also associates 
GO terms with supra-domains (Table 1). In general, 
supra-domains are defined as recurring combinations of 
two or more successive domains that could function 
together. In dcGO, we only include completely assigned 
supra-domains, without any significant gaps between 
domains, that is, supra-domains with regions not 
assigned to a known domain are excluded (26). GO asso- 
ciations to supra-domains hold great promise for under- 
standing how domain combinations contribute to 
functional diversification and also in predicting the func- 
tions of multi-domain proteins. 

Domain-centric BO 

In dcGO, the 'BO' refers to all other Biomedical 
Ontologies that are not GO. They mainly consist of 
phenotype ontologies that have been developed to 
classify and organize information on model organisms 
and human. Similarly to GO, the BO is hierarchical 
going from general terms at the top to more specific 
terms at the bottom. As with domain-centric GO, dcGO 
has the associations of the BO terms to individual 
domains and supra-domains; each has its own slim 
version of the ontology at four levels of increasing granu- 
larity based on information content. Unlike the GO, the 
BO does not have the high-quality version of the associ- 
ations. This is largely because of an insufficient number of 
single-domain proteins with annotations, especially for 
species-specific ontologies. As listed in Table 1, currently 
dcGO has eight phenotype and/or anatomy ontologies 
covering seven major model organisms. They include 
Mouse/Mammalian Phenotypes (MP) from Mouse 
Genome Informatics (MGI) (11), Worm Phenotypes 
(WP) from WormBase (12), Yeast/Ascomycete 
Phenotype (YP) from Saccharomyces Genome Database 
(SGD) (13), Fly Phenotype (FP) and Fly Anatomy (FA) 
from FlyBase (14), Zebrafish Anatomy (ZA) from ZFIN 
(15), Xenopus Anatomy (XA) from Xenbase (16) and 
Arabidopsis Plant (AP) ontology from TAIR (17). In 
addition to model organisms, dcGO also contains three 
ontologies with specific relevance to humans, including 
Human Phenotype (HP) (18), Disease Ontology (DO) 
(19) and DrugBank ATC codes (DB). The remaining 
ontologies have a fixed-length or much-simplified hier- 
archy. These include Enzyme Commission (EC) (8), 
UniProtKB UniPathway (UP) (9) and UniProtKB 
Keywords (KW) (10). 

DATABASE WEBSITE 

Downloading data 

The underlying data summarized in Table 1 are available 
for download on the dcGO website. For each ontology, 
the full and slim versions are provided separately for 



individual domains (i.e. superfamilies and families) and 
supra-domains. In addition, the user can download the 
MySQL relational database tables along with detailed 
documentation. All downloadable files are free for 
academic or commercial use and are automatically 
updated fortnightly. 

Searching dcGO 

The faceted search on the dcGO website (Figure 1) is a 
mining hub for users, with additional bioinformatics tools 
hyperlinked from the search results. Full-text query is sup- 
ported for SCOP domains, ontologies and genomes. 
Identifier or accession number lookup is supported for 
sequences. Ontologies and SCOP domains are linked to 
pages for browsing their respective hierarchies. Every 
genome is presented within its phylogenetic context by 
linking to a species tree of life (called sTOL, see 
'Analysing GO terms over the species tree of life' 
section). There are also links from domains and onto- 
logical terms to the tree of life (to see their distribution 
across species). Search results returning BO terms are 
linked to a cross-ontology comparison tool, the phenotype 
similarity network (PSnet, see 'Cross-linking similar 
phenotypes' section). PSnet searches for terms from 
other ontologies with a similar profile of associations. 
For lookups returning a specific genome sequence, the 
user is provided with the facility to submit it automatically 
to the 'dcGO Predictor' for function, phenotype and 
disease prediction. In conclusion, the faceted search is 
designed for multi-tasking; it does not just provide 
search results but is intended to interconnect all the 
tools and cross-referencing abilities of dcGO. 

Browsing the hierarchies 

The 'BROWSE' navigation on the website (aforemen- 
tioned) provides browsing for the SCOP, GO and 
various BO hierarchies. The hierarchy-like structure of 
the SCOP (or ontology) has a domain (or term) as a 
node and its relations to parental nodes as directed 
edges. To navigate this hierarchy, we display all the 
paths from the current node upwards to the root 
ordered by the shortest distances. Also, all direct 
children of the current node are listed underneath to 
enable browsing downwards. In addition to the hierarchy 
itself, a tabbed interface is used to aid the display of 
domain-centric annotations in a subject-specific manner. 
The SCOP-orientated hierarchy shows terms used to 
annotate a domain, and vice versa, the ontology- 
orientated hierarchy shows domains/supra-domains 
annotated by a term. 

Analysing GO terms over the species tree of life 

The dcGO website is integrated with a species tree of life 
(called sTOL), which is provided by SUPERFAMILY 
(27). The sTOL is a fully resolved binary tree of species 
of completely sequenced organisms providing a phylogen- 
etic context. Within the sTOL, the presence/absence of 
domains and supra-domains are pre-computed and 
stored, both for extant genomes and for reconstructed an- 
cestral genomes in eukaryotes. The integration enables 
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Basic ... || Advanced ... \\ Mining hub ... 



-Quick start to mine the dcGO resource 



(Faceted Search dcCO) 
. SCOP domains (linkto SCOP Hierarchy and sTOl) 

. SCOP unique identifier (sunid) of a SCOP, e.g. 46458 stol 
. SCOP name or part thereof, e.g. hemoglobin sTOL 

. Ontologies (LINKTO GO Hierarchy and sTOL for GO; BO Hierarchy and PSnet for Bo) 

. GO identifier, e.g. go:0019827 sTOL 

. GO term or part thereof, e.g. stem differentiation sT0L 

. BO term Or part thereof, e.g. immune system cancer PSnet f abnormal cell differentiation PSnet 

. Genomes (LINKTO SUPERFAHILY database and stol) 
. A common organism name, e.g. human sTOL 
■ A genome name or part thereof, e.g. Homo sapiens sT01 - 

> Sequences (LINKTO dcGO Predictor) 

. Sequence identifier or accession number of a genomic sequence, e.g.: 

• NCBI bacterial sequences, e.g. ai | i6i3ii78| ref | NP_4i7 75a.i | "°°"*"' 

. Ensembl eukaryotic sequences, e.g. enspoooooooo 2 33 

. Swiss- Prot/TrEMBL entries, e.g. P2409i "' co "*""°' 
. A list of sequence identifiers or multiple sequence FASTA-formated sequences via dcGO Predictor (Batch Query) . 



( hide help ) 

• Common words (e.g., "and" and "the"), and commonly occurring but specialized words (e.g., "domain", "function") will be removed from search. 

• For multi-word searches, an entire phrase (aka AND) search is run first. If not found, any word (aka OR) is used instead. 

• Faceted research results will be organized and linked to the relevant pages, including one or more (if found) of: 

► SCOP Hierarchy for SCOP domains and their annotations; 

. GO Hierarchy for GO terms and their annotated domains/supra-domains; 

► BO Hierarchy for terms from Biomedical Ontologies that are not GO, and domains/supra-domains being annotated; 
. PSnet for cross-linking similar phenotypes based on domain annotations shared; 

. sTOL for the distribution of individual domain and GO-annotated domains over the tree, and enriched GO terms for extant and ancestral genomes; 

► dcGO Predictor for function, phenotype and disease predictions of >80 million sequences in >2,000 genomes, UniProt and hundreds of meta-genomes. 



Figure 1. The dcGO website has the 'Faceted Search' interface as a hub to mine the resource. By searching against keywords of interest, the user can 
access the resource in an organized manner and can link to additional analysis tools. 



cross-comparison between both resources for understand- 
ing the evolutionary context of the functional associations. 
The distribution of GO terms can be explored over the 
sTOL tree, which can be accessed either by links from 
the GO hierarchy pages or from the faceted search. GO 
term enrichment for extant and ancestral genomes (using 
domain-based GO enrichment analysis) can also be 
explored when browsing the tree. Thus, the sTOL adds 
an extra dimension to the dcGO resource utility. 

Cross-linking similar phenotypes 

Traditionally, phenotype ontologies are developed for 
within-species comparisons. Recently, many attempts 
have been made at cross-species comparisons (28-30), 
these studies mostly focusing on text mining and formal 
definitions. Within dcGO is a tool called 'PSnet' that 
cross-references terms between ontologies (mostly 
phenotypes in different model organisms). Given one 
phenotype, PSnet can be used to search for a similar can- 
didate phenotype on the basis of their shared domain an- 
notations (both at superfamily and family levels). The 
statistical significance of the shared domains versus the 
expected overlap by chance is evaluated by Fisher's 
exact test. PSnet reports a Z-score for the strength of 
the overlap, and a /"-value and false discovery rate 



(accounting for multiple hypothesis tests) for the signifi- 
cance. An information content-based similarity metric is 
used to rank the phenotype similarities; if a certain 
domain is more frequently annotated (less informative) 
than others, then its contribution to the phenotype simi- 
larity is less. In this way, for any given phenotype, PSnet 
will suggest the best-correlated terms from other 
ontologies. 

As a proof of principle, we consider the disease term 
'immune system cancer' [DOID: 0060083] from the 
Disease Ontology (19). In Figure 2, we illustrate with 
this example, how PSnet displays cross-links between 
correlated terms, which facilitates the development of 
hypotheses. In dcGO, 10 superfamilies and 13 families 
are associated to this term (Figure 2A). Supplementary 
Figure SIB shows the numerical pathway of associating 
immune system cancer with the immunoglobulin super- 
family, following the general procedure shown in 
Supplementary Figure SI A. Given this disease and its 
domain-centric annotation profile in Figure 2A, PSnet 
searches for phenotypes with similar domain annotations. 
As shown in Figure 2B, PSnet cross-references this disease 
term with closely related terms from the Human 
Phenotype ontology (18), suggests possible links to the 
abnormal counterparts in the Mouse Phenotype 
ontology (11), reveals the mechanisms by listing top 
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A Annotations by immune system cancer 
Superfamily 

Inhibitor of apoptosis f!AP) repeat 
Immunoglobulin 
4-helical cytokines 
SH2 domain 
SET domain 
TNF-like 
HD-domain/PDEase-like 
Bcl-2 inhibitors of programmed cell death 
GST C-terminal domain-like 
Homeodomain-like 

Family 

RUNT domain 
Inhibitor of apoptosis (IAP) repeat 
Short-chain cytokines 
SH2 domain 

V set domains (antibody variable domain-like) 
A DNA-binding domain in eukaryotic transcription factors 

TNF-like 
PDEase 
FHA domain 

Bcl-2 inhibitors of programmed cell death 
Glutathione S-transferase (GST), N-terminal domain 
BCR-homoiogy GTPase activation domain (BH-domain) 
Glutathione S-transferase (GST), C-terminal domain 



B 

Human Phenotype (HP) 

HP term Z-score P- value FDR Similarity metric 

Phenotypic Abnormality (PA) Hematological neoplasm 19.56 5.92e-10 5.54e-09 0.1253 
Phenotypic Abnormality (PA) Neoplasm by anatomical site 14.68 9.52e-09 4.84e-08 0.1013 

Mouse Phenotype (MP) 



MP term 


Z-score P-value FDR Similarity metric 


Mammalian Phenotype (MP) abnormal B cell physioloay 


22.01 0 0 0.1893 


Mammalian Phenotype CMP1 abnormal lymphocyte physioloay 


22.03 0 0 0.1824 


Mammalian Phenotype (MP! abnormal immune system organ morpholoay 


20.70 0 0 0.1619 


Mammalian Phenotype fMP) abnormal bone marrow cell morpholoay/development 19.24 0 0 0.1491 



Enzyme Commission (EC) 

EC term 

Enzyme Commission (EC) Methylarsonate reductase 

Enzyme Commission (EC) Glutathione dehydrogenase (ascorbate) 

Enzyme Commission (EC) Carbon-halide lyases 

Enzyme Commission (EC) Maleylacetoace tate isomerase 

DrugBank ATC (DB) 



Z-score P-value FDR Similarity metric 
24.07 1.75e-09 1.40e-08 0.1316 
20.81 5.53e-09 3.04e-08 0.1267 
20.81 5.53e-09 3.04e-08 0.1267 
20.81 5.53e-09 3.04e-08 0.1267 



DB term 


Z-score P-value FDR Similarity metric 


Druqbank ATC code (DB) antineoplastic and immunomodulatinq aaents 


17.55 2.44e-14 3.94e-13 0.1391 


Druabank ATC code CDB1 antineoplastic aaents 

Druabank ATC code fDBI platelet aaareaation inhibitors excl. heparin 


17.52 2.08e-13 2.78e-12 0.1424 
17.20 1.98e-10 1.97e-09 0.1470 


Druabank ATC code fDBI selective immunosuppressants 


19.56 5.92e-10 5.54e-09 0.1386 



Figure 2. Using 'PSnet' to cross-link phenotypes and other ontologies based on shared domain-centric annotations. (A) A list of superfamilies and 
families annotated by a disease term 'immune system cancer'. (B) The top well-correlated ontological terms are returned for the disease term in this 
query. 



enzymes from the Enzyme Commission (8) and implicates 
the treatment agents through DrugBank ATC codes (20). 
The multiple layers of information revealed by PSnet 
provide a powerful tool for hypothesis generation. PSnet 
brings additional understanding to the essential roles that 
protein domains can play in functions, phenotypes and 
diseases. 

Predicting functions, phenotypes and diseases for 
>80 million sequences 

Using domain-centric GO annotations as a functional pre- 
dictor, we entered the Critical Assessment of Function 
Annotation competition and came in the top 10 of >50 
methods. Considering that only domain information is 
involved and natively used as a single direct prediction, 
its relative success validates the quality of this resource for 
widespread use. We provide pre-computed annotations 
for >80 million sequences (at the time of writing) stored 
in the SUPERFAMILY database that includes 2414 
genomes, UniProt and hundreds of meta-genomes. 
Through the dcGO Predictor (Figure 3), functions and 
other higher-order knowledge (phenotypes, diseases and 
more) can be predicted for user-submitted sequences. 
The implementation is fairly straightforward: first the 
domain architecture of the query protein is determined, 
and then the ontological terms associated with its compo- 
nent domains/supra-domains are transferred to the query 
protein. A score is provided for ranking the confidence of 
such predictions/transfers. In addition to access through 
the faceted search, a batch query mode is provided, which 
allows the submission of up to 1000 sequences at a time 
(Figure 3A). The prediction results are summarized to give 
an overview of the prediction content and are also avail- 
able for download (Figure 3B). Figure 3C shows the 
results for the example input sequence 'Q01826', i.e. 
special AT-rich sequence-binding protein 1 (SATB1). As 



a chromatin regulator, this protein has been reported to 
promote tumour growth and metastasis (31), which is 
consistent with the prediction. For a sequence to receive 
annotation, first there must be domains detected by the 
SUPERFAMILY hidden Markov model library search, 
and then those domains/supra-domains must have onto- 
logical associations in dcGO. Our coverage will improve 
as new structures are deposited in the Protein Data Bank 
(32) and as more sequences have ontological terms experi- 
mentally determined. 



CONCLUSION AND FUTURE DEVELOPMENTS 

With the rate of growth of biological (e.g. sequence) data 
increasing rapidly, the only realistic way to analyse biology 
in a holistic way is computationally. Gene Ontology has 
become a widely adopted medium for handling biological 
concepts in a structured way that can be processed compu- 
tationally. With this unique database 'dcGO', treating 
domains and supra-domains as functional units, we 
provide GO plus a growing number of other ontologies in 
a probabilistic framework. The results of the Critical 
Assessment of Function Annotation experiment show 
that this domain-centric approach performs significantly 
better than simple whole-sequence pair-wise homology on 
the task of labelling sequences of unknown function with 
GO terms; by 'simple whole-sequence homology', we mean 
the strategy of annotating a sequence with GO terms by 
searching it against UniProt using Basic Local Alignment 
Search Tool and transferring any GO terms associated with 
significant hits. Thus, the dcGO database, in providing full 
functional annotations of all completely sequenced 
genomes in addition to the domain-ontology associations 
themselves, makes a massive contribution to the body of 
computer-readable biological knowledge. In the future, 
the intention is to expand the ontologies included in the 
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r~ dcGO Predictor (Batch Query) Functions, Diseases, Phenotypes,Others 

Functions, Diseases, Phenotypes 

Step 1: select ontologies (Functions, Diseases, Phenotypes and Others): . Functions.Diseases 

Phenotypes 



Step 2: choose the input sequence format: sequence identifier ^ 

Q01826 
Q8TCS8 
075376 
Q05516 
075362 
094776 
P17542 
Q03112 
060424 
075427 
096017 
P06400 
P27694 
P42345 
^30750 



Step 3: paste/upload your sequences : [choowfHej no file selected 



Others 



Examples: 

• Use the above example by default (one line per ID for 'Sequence identifier'); 
. Or, check □ to use example sequences (Multiple sequence FASTA format for 'Amino acid sequence') 

Step 4: dcco Predictor (Please check the format of your input sequences) 



Functions: 

Gene Ontology (CO) 
Diseases: 

Disease Ontology (DO) 
Phenotypes: 

Human Phenotype (HP) 
Mouse Phenotype (MP) 
Worm Phenotype (WP) 
Yeast Phenotype (YP) 
Fly Phenotype (FP) 
Fly Anatomy (FA) 
Zebrah'sh Anatomy (ZA) 
Xenopus Anatomy (XA) 
Arabidopsis Plant Ontology (AP) 
Others: 

Enzyme Commission (EC) 
DrugBank ATC (DB) 
UniProtKB KeyWords (KW) 
UniProtKB UniPathway (UP) 



B You can switch to other ontologies: 



Functions, Diseases, Phenotypes ,0th er- 



New Prediction 



GO DO 



Disease Ontology (DO) 



Summarize predictions || Download predictions || Explore predictions | 

. Amongst 20 sequences in query, 18 are predictable (i.e., containing the residual domains/supra-domains that are annotated 
by at least one DO term). 

. The predictions are summarized by counting the number (percentage) of sequences annotated by DO terms at four levels 

(i.e., slim version; SDDO), available at Export summary . 
. Below display DO terms (with > 15% of predictable sequences) to give a summary of the prediction content: 



DO term 


SDDO (four levels) 


#Sequences (%) 
5 (28%) 


Disease Ontoloqv (DO) organ system cancer 


Highly qeneral 


Disease Ontology (DO) cell type cancer 


General 


9 (50%) 


Disease Ontoloav (DO) beniqn neoplasm 


General 


3 (17%) 


Disease Ontoloqy (DO) qenetic disease 


General 


3 (17%) 


Disease Ontoloav fDO) immune system cancer 


General 


3 (17%) 


Disease Ontoloqy (DO) autosomal recessive disease 


Specific 


3 (17%) 


Disease Ontoloav fDO) carcinoma 


Specific 


3 (17%) 


Disease Ontoloav fDO) leukemia 


Specific 


3 (17%) 


Disease Ontoloqy (DO) ataxia telangiectasia 


Highly specific 


3 (17%) 


Disease Ontoloav fDO) lymphoblastic leukemia 


Highly specific 


3 (17%) 



Domain architecture 



I Lambda repressor-like DNA 



I Homeodomain-like 



Disease Ontology (DO) 



Predictions (slim version) \\ Predictions (full version) 



Export predictions (based on slim version of DO terms) 



DO term SDDO (four levels) Predictive score 

Disease Ontology fDO) organ system cancer Highly qeneral 1.000 
Disease Ontoloqy (DO) immune system cancer General 1.000 
Disease Ontology fDO) leukemia Specific 1.000 

Disease Ontoloqy (DO) lymphoblastic leukemia Highly specific 1.000 



Figure 3. Converting genome sequences to knowledge about function, phenotype and disease using the 'dcGO Predictor'. (A) A batch query facility 
allows the user to upload up to 1000 sequences for the prediction on function, disease, phenotype and other information, such as enzyme classi- 
fication, drugs and pathways. (B) The result page provides a summary of the prediction content. New predictions are supported by instantly 
switching to other ontologies. In addition to the download, the user can also explore predictions for each of the input sequences, such as 
Q01826 (human SATB1 protein; see next). (C) The domain architecture of the human SATB1 protein is graphically displayed using the SCOP 
domains at the superfamily level, whereas the bottom panel shows the predicted Disease Ontology terms. 
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database, expand the domain collection as more domains 
are classified, and expand the collection of functionally 
annotated sequences as new genomes, meta-genomes and 
so forth are released. 

In addition to the value of the large-scale raw annota- 
tions in the dcGO database, the anticipated potential for 
comparative analyses is already reflected in the sTOL evo- 
lutionary context and PSnet cross-referencing tools. Other 
than the data expansion aforementioned, other future de- 
velopments will focus on introducing more comparative 
tools and increasing the use cases of the existing ones. 
These will include network-based infrastructures, for 
example, of domains and of terms spanning different 
ontologies. The construction of functional domain 
networks with respect to GO is already on the agenda. 



SUPPLEMENTARY DATA 

Supplementary Data are available at NAR Online: 
Supplementary Table 1 and Supplementary Figure 1. 
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